perm filename PROPO2[7,ALS] blob sn#051783 filedate 1973-07-03 generic text, type T, neo UTF8
00010									June 26 1973
00020	
00030			 A Proposal for Speech Understanding Research
00040	
00050	
00060		It is proposed that the work on speech recognition that is
00070	now under way in the A.I. project at Stanford University be continued
00080	and extended with broadened aims in the field
00090	of speech understanding. This work gives considerable promise both of
00100	solving some of the immediate problems that beset speech
00110	understanding research and of providing a basis for future advances.
00120	
00130		It is further proposed that this work be more closely tied to
00140	the ARPA Speech Understanding Research effort than it has been in the
00150	past and that it have as its express aim the study and application to
00160	speech recognition of a machine learning process, that has proved
00170	highly successful in another application and that has already been
00180	tested out to a limited extent in speech recognition. The machine
00190	learning process offers both an automatic training scheme and the
00200	inherent ability of the system to adapt to various speakers and
00210	dialects. Speech recognition via machine learning represents a global
00220	approach to the speech recognition problem and can be incorporated
00230	into a wide class of limited vocabulary systems.
00240	
00250		Finally we would propose accepting responsibility for keeping
00260	other ARPA projects supplied with operating versions of the best
00270	current programs that we have developed. The availability of the high
00280	quality front end that the signature table approach provides would 
00290	enable designers of the various over-all systems
00300	to test the relative performance of the top-down portions of their
00310	systems without having to make allowances for the deficiencies
00320	of their currently available front ends. Indeed, if the signature table
00330	scheme can be made simple enough to compete on a time basis (and we
00340	believe that it can) then it may replace the other front end
00350	schemes that are currently in favor.
00360	
00370		Stanford University is well suited as the site for such work,
00380	having both the facilities for this work and a staff of people with
00390	experience and interest in machine learning, phonetic analysis, and
00400	digital signal processing. The staff at present consists of the
00410	proposed Principal Investigator Arthur l. Samuel, one post-doctoral
00420	staff member Ravindra Thosar, who
00430	had worked on speech  recognition and synthesis in India, a
00440	second member Dr. Neil Miller, who has had considerable signal-processing
00450	experience and a few graduate students. It is anticipated that a staff
00460	of not more than 3 full time members with the help of 3 or 4 graduate
00470	students could mount a meaningful program, which should be funded for a
00480	mimimum of two years to ensure continuity of effort.
00490	We would expect to demonstrate the utility of the
00500	Signature Table approach within this time span and to provide a working
00510	system that could be used as the front end for any of the
00520	speech understanding systems that are currently under
00530	development or are being planned.
00550	
00560		Ultimately we would
00570	like to have a system capable of understanding speech from an
00580	unlimited domain of discourse and with an unknown speaker. It seems not
00590	unreasonable to expect the system to deal with this situation very
00600	much as people do when they adapt their understanding processes to
00610	the speakers idiosyncrasies during the conversation. The signature table
00620	method gives promise of contributing toward the solution of this
00630	problem as well as being a
00640	possible answer to some of the more immediate problems.
00650	
00660		The initial thrust of the proposed work would be toward the
00670	development of adaptive learning techniques, using the signature
00680	table method and some more recent varients and extentions of this
00690	basic procedure. We have already demonstrated the usefulness of this
00700	method for the initial assignment of significant features to the
00710	acoustic signals. One of the next steps will be to extend the method
00720	to include acoustic-phonetic probabilities in the decision process.
00730	
00740		Still another aspect to be studied would be the amount of
00750	preprocessing that should be done and the desired balance between
00760	bottom-up and top-down approaches. It is fairly obvious that
00770	decisions of this sort should ideally be made dynamicallly depending
00780	upon the familiarity of the system with the domain of
00790	discourse and with the characteristics of the speaker.
00800	Compromises will undoubtedly have to be made in any immediately
00810	realizable system but we should understand better than we now do the
00820	limitations on the system that such compromises impose.
00830	
00840		It may be well at this point to describe the general
00850	philosophy that has been followed in the work that is currently under
00860	way and the results that have been achieved to date. We have been
00870	studying elements of a speech recognition system that is not
00880	dependent upon the use of a limited vocabulary and that can recognize
00890	continuous speech by a number of different speakers.
00900	
00910		Such a system should be able to function successfully either
00920	without any previous training for the specific speaker in question or
00930	after a short training session in which the speaker would be asked to
00940	repeat certain phrases designed to train the system on those phonetic
00950	utterances that seemed to depart from the previously learned norm. In
00960	either case it is believed that some automatic or semi-automatic
00970	training system should be employed to acquire the data that is used
00980	for the identification of the phonetic information in the speech. We
00990	believe that this can best be done by employing a modification of the
01000	signature table scheme previously discribed. A brief review of this
01010	earlier form of signature table is given in Appendix 1.
01020	
01030		The over-all system is envisioned as one in which the more or
01040	less conventional method is used of separating the input speech into
01050	short time slices for which some sort of frequency analysis,
01060	homomorphic, LPC, or the like, is done. We then interpret this
01070	information in terms of significant features by means of a set of
01080	signature tables. At this point we define longer sections of the
01090	speech called segments which are obtained by grouping together varying
01100	numbers of the original slices on the basis of their similarity. This
01110	then takes the place of other forms of initial segmentation. Having
01120	identified a series of  in this way we next use another set of
01130	signature tables to extract information from the sequence of segments
01140	and combine it with a limited amount of syntactic and semantic
01150	information to define a sequence of phonemes.
01160	
01170		While it would be possible to extend this bottom up approach
01180	still further, it seems reasonable to break off at this point and
01190	revert to a top down approach from here on. The real difference in
01200	the overall system would then be that the top down analysis would
01210	deal with the outputs from the signature table section as its
01220	primitives rather than with the outputs from the initial measurements
01230	either in the time domain or in the frequency domain. In the case of
01240	inconsistencies the system could either refer to the second choices
01250	retained within the signature tables or if need be could always go
01260	clear back to the input parameters. The decision as to how far to
01270	carry the initial bottom up analysis must depend upon the relative
01280	cost of this analysis both in complexity and processing time and the
01290	certainty with which it can be performed as compared with the costs
01300	associated with the rest of the analysis and the certainty with which
01310	it can be performed, taking due notice of the costs in time of
01320	recovering from false starts.
01330	
01340		Signature tables can be used to perform four essential
01350	functions that are required in the automatic recognition of speech.
01360	These functions are: (1) the elimination of superfluous and
01370	redundant information from the acoustic input stream, (2) the
01380	transformation of the remaining information from one coordinate
01390	system to a more phonetically meaningful coordinate system, (3) the
01400	mixing of acoustically derived data with syntactic, semantic and
01410	linguistic information to obtain the desired recognition, and (4) the
01420	introduction of a learning mechanism.
01430	
01440		The following three advantages emerge from this method of
01450	training and evaluation.
01460		1) Essentially arbitrary inter-relationships between the
01470	input terms are taken in account by any one table. The only loss of
01480	accuracy is in the quantization.
01490		2) The training is a very simple process of accumulating
01500	counts. The training samples are introduced sequentially, and hence
01510	simultaneous storage of all the samples is not required.
01520		3) The process linearizes the storage requirements in the
01530	parameter space.
01540	
01550		The signature tables, as used in speech recognition, must be
01560	particularized to allow for the multi-category nature of the output.
01570	Several forms of tables have been investigated. Details of the current
01580	system are given in Appendix 2. For some early results  see 
01590	SUR Note 43 "Some Preliminary Experiments in Speech Recognition
01600	Using Signature Tables" by R.B.Thosar and A.L.Samuel.
01620	
01630		Work is currently under way on a major refinement of the
01640	signature table approach which adopts a somewhat more rigorous
01650	procedure. Preliminary results with this scheme indicate that a
01660	substantial improvement has been achieved. This effort is described in
01670	a recent report SUR Note 81 on "Estimation of Probability Density Using
01680	Signature Tables for Application to Pattern Recognition, by 
01690	R.B.Thosar.
01700	
01720		We are currently involved in work on a segmentation
01730	procedure which has already demonstrated its ability to compete with other
01740	proposed segmentation systems, even when used to process speech from 
01750	speakers whose utterances  were not used during the training
01760	sequence.
     

00010	FACILITIES
00020	
00030	The computer  facilities  of  the  Stanford  Artificial  Intelligence
00040	Laboratory include the following equipment.
00050	
00060	Central Processors:  Digital Equipment Corporation PDP-10 and PDP-6
00070	
00080	Primary Store:       65K words of 1.7 microsecond DEC Core
00090		             65K words of 1 microsecond Ampex Core
00100	                     131K words of 1.6 microsecond Ampex Core
00110	
00120	Swapping Store:      Librascope disk (5 million words, 22 million
00130	                     bits/second transfer rate)
00140	
00150	File Store:          IBM 3330 disc file, 6 spindles (leased)
00160	
00170	Peripherals:         4 DECtape drives, 2 mag tape drives, line printer,
00180		             Calcomp plotter, Xerox Graphics Printer
00190	
00200	Communications
00210	    Processor:	     BBN IMP (Honeywell DDP-516) connected to the
00220			     ARPA network.
00230	
00240	Terminals:           58 TV displays, 6 III displays, 3 IMLAC displays,
00250		             1 ARDS display, 15 Teletype terminals
00260	
00270	Special  Equipment:  Audio  input  and  output  systems, hand-eye
00280	                     equipment (2 TV cameras, 3 arms), remote-
00290	                     controlled cart
     

00010	   		RESEARCH GRANT BUDGET
00020			
00030			TWO YEARS BEGINNING OCTOBER 1, 1973
00040	
00050	
00060	BUDGET CATEGORY					YEAR 1	YEAR 2
00070	-----------------------------------------------------------------
00080	I. SALARIES & WAGES:
00090		
00100		Samuel, A.,
00110		Senior Research Associate
00120		Principal Investigator, 75%		 20,000	 20,000
00130	
00140		------,
00150		Research Associate			 14,520	 14,520
00160	
00170		Miller, N.,
00180		Research Associate			 13,680	 13,680
00190	
00200		------,
00210		Student Research Assistant,
00220		50% academic year, 100% summer		  4,914	  5,070
00230	
00240		------,
00250		Student Research Assistant,
00260		50% academic year, 100% summer		  4,914	  5,070
00270	
00280		Reserve for Salary Increases
00290		@ 5.5% per year				  2,901	  5,980
00300							-------	-------
00310	
00320		TOTAL SALARIES AND WAGES		$60,929 $64,320
00330	
00340	II. STAFF BENEFITS:
00350	
00360		17.0% 10-1-73 to 8-31-74		  9,495
00370		18.3% 9-1-74 to 8-31-75			    929  10,790
00380		19.3% 9-1-75 to 9-30-75				  1,034
00390							-------	-------
00400		TOTAL STAFF BENEFITS			$10,424 $11,824
00410	
00420	III. TRAVEL:
00430	
00440		Domestic -
00450			Local		150
00460			East Coast	450
00470					---
00480							   $600    $600
00490	
00500	IV.  EXPENDABLE MATERIALS & SERVICES:
00510	
00520		A. Telephone Service	480
00530		B. Office Supplies	600
00540					---
00550							 $1,080  $1,080
00560	
00570	V.  PUBLICATIONS COST:
00580	
00590		2 Papers @ 500 ea.			 $1,000  $1,000
00600							------- -------
00610	
00620	VI. TOTAL DIRECT COSTS:
00630	
00640		(Items I through V)			$74,033 $78,824
00650	
00660	VII. INDIRECT COSTS:
00670	
00680		On Campus - 47% of NTDC			$34,796 $37,047
00690	
00700							-------	-------
00710	VIII. TOTAL COSTS:
00720	
00730		(Items VI + VII)		       $108,829 $115,871          
00740						       -------- --------
     

00010	COGNIZANT PERSONNEL
00020	
00030	
00040	        For contractual matters:
00050	
00060	                Office of the Research Administrator
00070	                Stanford University
00080	                Stanford, California 94305
00090	
00100	                Telephone: (415) 321-2300, ext. 3330
00110	
00120	        For technical and scientific matters regarding this proposal:
00130	
00140	                Arthur l. Samuel
00150	                Computer Science Department
00160	                Stanford University
00170	                Stanford, California 94305
00180	
00190	                Telephone: (415) 321-2300, ext. 4971
00200	
00210	        For administrative matters, including questions relating
00220	        to the budget or property acquisition:
00230	
00240	                Mr. Lester D. Earnest
00250	                Computer Science Department
00260	                Stanford University
00270	                Stanford, California 94305
00280	
00290	                Telephone: (415) 321-2300, ext. 4971
     

00010	
00020			Appendix 1
00030	
00040		The early form of a signature table
00050	
00060		For those not familiar with the use of signature tables as
00070	used by Samuel in programs which played the game of checkers, the
00080	concept is best illustrated (Fig.1) by an arrangement of tables used
00090	in the program. There are 27 input terms. Each term evaluates a
00100	specific aspect of a board situation and it is quantized into a
00110	limited but adequate range of values, 7, 5 and 3, in this case. The
00120	terms are divided into 9 sets with 3 terms each, forming the 9 first
00130	level tables. Outputs from the first level tables are quantized to 5
00140	levels and combined into 3 second level tables and, finally, into one

00150	third-level table whose output represents the figure of merit of the
00160	board in question.
00170	
00180		A signature table has an entry for every possible combination
00190	of the input vector. Thus there are 7*5*3 or 105 entries in each of
00200	the first level tables. Training consists of accumulating two counts
00210	for each entry during a training sequence. Count A is incremented
00220	when the current input vector represents a prefered move and count D
00230	is incremented when it is not the prefered move. The output from the
00240	table is computed as a correlation coeficient
00250	 			C=(A-D)/(A+D).
00260		The figure of merit for a board is simply the
00270	coefficient obtained as the output from the final table.
     

00010			Appendix 2
00020	
00030		Initial Form of Signature Table for Speech Recognition
00040	
00050		The signature tables, as used in speech recognition, must be
00060	particularized to allow for the multi-catagory nature of the output.
00070	Several forms of tables have been investigated. The initial form
00080	tested and used for the data presented in the attached paper uses
00090	tables consisting of two parts, a preamble and the table proper. The
00100	preamble contains: (1) space for saving a record of the current and
00110	recent output reports from the table, (2) identifying information as
00120	to the specific type of table, (3) a parameter that identifies the
00130	desired output from the table and that is used in the learning
00140	process, (4) a gating parameter specifying the input, that is to be
00150	used to gate the table, (5) the sign of the gate,
00160	 (6) the gating level to be used and (7)
00170	parameters that identify the sources of the normal inputs to the
00180	table.
00190	
00200		All inputs are limited in range and specify either the
00210	absolute level of some basic property or more usually the probability
00220	of some property being present. These inputs may be from the original
00230	acoustic input or they may be the outputs of other tables. If from
00240	other tables they may be for the current time step or for earlier
00250	time steps, (subject to practical limits as to the number of time
00260	steps that are saved).
00270	
00280		The output, or outputs, from each table are similarly limited
00290	in range and specify, in all cases, a probability that some
00300	particular significant feature, phonette, phoneme, word segment, word
00310	or phrase is present.
00320	
00330		We are limiting the range of inputs and outputs to values
00340	specified by 3 bits and the number of entries per table to 64
00350	although this choice of values is a matter to be determined by
00360	experiment. We are also providing for any of the following input
00370	combinations, (1) one input of 6 bits, (2) two inputs of 3 bits each,
00380	(3) three inputs of 2 bits each, and (4) six inputs of 1 bit each.
00390	The uses to which these differint forms are put will be described
00400	later.
00410	
00420		The body of each table contains entries corresponding to
00430	every possible combination of the allowed input parameters. Each
00440	entry in the table actually consists of several parts. There are
00450	fields assigned to accumulate counts of the occurrances of incidents
00460	in which the specifying input values coincided with the different
00470	desired outputs from the table as found during previous learning
00480	sessions and there are fields containing the summarized results of
00490	these learning sessions, which are used as outputs from the table.
00500	The outputs from the tables can then express to the allowed accuracy
00510	all possible functions of the input parameters.
00520	
00530	Operation in the Training Mode
00540	
00550		When operating in the training mode the program is supplied
00560	with a sequence of stored utterances with accompanying phonetic
00570	transcriptions. Each sample of the incoming speech signal is
00580	analysed (Fourier transforms or inverse filter equivalent) to obtain
00590	the necessary input parmeters for the lowest level tables in the
00600	signature table hierarchy. At the same time reference is made to a
00610	table of phonetic "hints" which prescribe the desired outputs from
00620	each table which correspond to all possible phonemic inputs. The
00630	signature tables are then processed.
00640	
00650		The processing of each table is done in two steps, one
00660	process at each entry to the table and the second only periodically.
00670	The first process consists of locating a single entry line within the
00680	table as specified by the inputs to the table and adding a 1 to the
00690	appropriate field to indicate the presence of the property specified
00700	by hint table as corresponding to the phoneme specified in the
00710	phonemic transcription. At this time a report is also made as to the
00720	table's output as determined from the averaged results of previous
00730	learning so that a running record may be kept of the performance of
00740	the system. At periodic intervals all tables are updated to
00750	incorporate recent learning results. To make this process easily
00760	understandable, let us restrict our attention to a table used to
00770	identify a single significant feature say Voicing. The hint table
00780	will identify whether or not the phoneme currently being processed is
00790	to be considered voiced. If it is voiced, a 1 is added to the "yes"
00800	field of the entry line located by the normal inputs to the table. If
00810	it is not voiced, a 1 is added to the "no" field. At updating time
00820	the output that this entry will subsequently report is determined by
00830	dividing the accumulated sum in the "yes" field by the sum of the
00840	numbers in the "yes" and the "no" fields, and reporting this quantity
00850	as a number in the range from 0 to 7. Actually the process is a bit
00860	more complicated than this and it varies with the exact type of table
00870	under consideration, as reported in detail in appendix B. Outputs
00880	from the signature tables are not probabilities, in the strict sense,
00890	but are the statistically-arrived-at odds based on the actual
00900	learning sequence.
00910	
00920		The preamble of the table has space for storing twelve past
00930	outputs. An input to a table can be delayed to that extent. This table
00940	relates outcomes of previous events with the present hint-the
00950	learning input. A certain amount of context dependent learning is thus
00960	possible with the limitation that the specified delays are constant.
00970	
00980		The interconnected hierarchy of tables form a network which
00990	runs increamentally, in steps synchronous with time window over which
01000	the input signal is analised. The present window width is set at 12.8
01010	ms.(256 points at 20 K samples/sec.) with overlap of 6.4 ms. Inputs
01020	to this network are the parameters abstracted from the frequency
01030	analyses of the signal, and the specified hint. The outputs of the
01040	network could be either the probability attached to every phonetic
01050	symbol or the output of a table associated with a feature such as
01060	voiced, vowel etc. The point to be made is that the output generated
01070	for a sample is essentially independent of its contiguous
01080	samples. The dependency achieved by using delayes in the inputs is
01090	invisible to the outputs. The outputs thus report the best estimate on
01100	what the current acoustic input is with no relation to the past
01110	outputs. Relating the successive outputs along the time dimension is
01120	realised by counters.
01130	
01140	The Use of COUNTERS
01150	
01160		The transition from initial sample space to segment space is
01170	made posible by means of COUNTERS which are summed and reiniated
01180	whenever their inputs cross specified threshold values, being
01190	triggered on when the input exceeds the threshold and off when it
01200	falls below. Momentary spikes are eliminated by specifying time
01210	hysteresis, the number of consecutive samples for which the input
01220	must be above the threshold. The output of a counter provides
01230	information about starting time, duration and average input for the
01240	period it was active.
01250	
01260		Since a counter can reference a table at any level in the
01270	hierarchy of tables, it can reflect any desired degree of information
01280	reduction. For example, a counter may be set up to show a section of
01290	speech to be a vowel, a front vowel or the vowel /I/. The counters can
01300	be looked upon to represent a mapping of parameter-time space into a
01310	feature-time space, or at a higher level symbol-time space. It may be
01320	useful to carry along the feature information as a back up in those
01330	situations where the symbolic information is not acceptable to
01340	syntactic or semantic interpretation.
01350	
01360		In the same manner as the tables, the counters run completely
01370	independent of each other. In a recognition run the counters may
01380	overlap in arbitrary fashion, may leave out gaps where no counter has
01390	been triggered or may not line up nicely. A properly segmented output,
01400	where the consecutive sections are in time sequence and are neatly
01410	labled, is essential for processing it further. This is achieved by
01420	registering the instants when the counters are triggered or
01430	terminated to form time slices called segments.
01440	
01450		An event is the period between successive activation or
01460	termination of any counter. An event shorter than a specified time is
01470	merely ignored. A record of event durations and upto three active
01480	counters, ordered according to their probability, is maintained.
01490	
01500		An event resulting from the processing described so far,
01510	represents a phonette - one of the basic speech categories defined as
01520	hints in the learning process. It is only an estimate of closeness to
01530	a speech category , based on past learning. Also each category has a
01540	more-or-less stationary spectral characterisation. Thus a category may
01550	have a phonemic equivalent as in the case of vowels , it may be
01560	common to phoneme class as for the voiced or unvoiced stop gaps or it
01570	may be subphonemic as a T-burst or a K-burst. The choices are based on
01580	acoustic expediency, i.e. optimisation of the learning rather than
01590	any linguistic considerations. However a higher level interpretive
01600	programs may best operate on inputs resembling phonemic
01610	trancription. The contiguous segments may be coalesced into phoneme like
01620	units using diadic or triadic probabilities and acoustic-phonetic
01630	rules particular to the system. For example, a period of silence
01640	followed by a type of burst or a short friction may be combined to
01650	form the corresponding stop. A short friction or a burst following a
01660	nasal or a lateral may be called a stop even if the silence period is
01670	short or absent. Clearly these rules must be specific to the system,
01680	based on the confidence with which durations and phonette categories
01690	are recognised.